Wild orangutans can simultaneously use two independent vocal sound sources similarly to songbirds and human beatboxers

Abstract Speech is among the most complex motoric tasks humans ever perform. Songbirds match this achievement during song production through the precise and simultaneous motor control of two sound sources in the syrinx. Integrated and intricate motor control has made songbirds comparative models par excellence for the evolution of speech, however, phylogenetic distance with humans prevents an improved understanding of the precursors that, within the human lineage, drove the emergence of advanced vocal motor control and speech. Here, we report two types of biphonic call combination in wild orangutans that articulatorily resemble human beatboxing and that result from the simultaneous exercise of two vocal sound sources: one unvoiced source achieved through articulatory maneuvering of the lips, tongue, and jaw as typically used for consonant-like call production, plus one voiced source achieved through laryngeal action and voice activation as typically used for vowel-like call production. Orangutan biphonic call combinations showcase unappreciated levels of, and distinct neuromotor channels for, vocal motor control in a wild great ape, providing a direct vocal motor analogy with birdsong based on the precise and simultaneous co-control of two sound sources. Findings suggest that speech and human vocal fluency likely built upon complex call combination, coordination and coarticulation capacities that involved vowel-like and consonant-like calls in an ancestral hominid.


Introduction
The neurobiology of speech production involves some of the most complex choreographed movements performed by humans. In nonhuman animals, similar degrees of vocal motoric coordination are notoriously accomplished by songbirds, namely, through the precise motor control of two sound sources in the passerine syrinx during song production (1,2), each sound source equivalent to one typical set of vocal folds in mammals. Vocal analogy between humans and songbirds (3) cannot, however, elucidate shared ancestry, rendering bird-human functional parallels mute about how or why complex vocal control was implemented by hominid brains in hominid bodies in the process of human ancestors becoming language-able.
Even under the most conservative hypothetical scenario, where spoken language might be considered to have come into existence anew and abruptly in modern humans, one would still expect it to have been directly or indirectly leveraged on the neuromotoric mechanisms already present in prelinguistic hominid ancestors. Using the simile of saltationist hypothesis (4) that language was ignited as if by lightning strike, lightning can only kindle a fire when combustible materials are present where it strikes. Reconstructing the forerunning system of speech among ancestral hominids (5,6) shall, therefore, remain imperative for explaining why spoken language emerged in the human lineage, irrespective of one's theoretical leanings or alliances.
Here, we report two outstanding vocal behaviors in a wild great ape that represent direct analogies with birdsong production based on two sound sources. This evidence sheds new light on the putative vocal range and complexity of ancient hominids. Furthermore, it may finally allow a reading of birdsong neurobiology from a hominid lens, enabling the exchange of bird-ape comparative knowledge, helping formulate cross-taxa predictions, and ultimately, accelerating the reconstruction of the evolutionary precursors of speech within the Hominid family.

Results
We observed two types of biphonic call production in wild orangutans (Pongo spp), both involving voiceless consonant-like call production (7) "in combination and overlap with" (hereafter "+") voiced vowel-like call production. These vocal behaviors involved supralaryngeal maneuvering (i.e. of lips, tongue, and/or jaws) together with laryngeal action (i.e. vocal fold oscillation as activated by egressive airflow), respectively. They were, thus, distinct from cases of biphonation as the result of nonlinear oscillations in the mammalian larynx (8) or achieved by merging two formants, as in human throat singing (9).
Simultaneous voiceless + voiced call production involved four known and previously described call types of the orangutan repertoire, which can be produced solo or in combination with other calls (10). They are presumed present across wild populations (10) and do not represent obvious cases of local-specific call types, such as other described orangutan vocal traditions (11).
The first biphonic call combination was composed by "chomps + grumbles" and produced by Bornean flanged male orangutans (Pongo pygmaeus) in disturbance and precombat contexts (10) at Tuanan, Central Kalimantan, Indonesian Borneo (see Audio S1). Across 2,510 observation hours, 30 chomps were recorded from two flanged males, and 111 grumbles were recorded from seven flanged males. Among these, there were 16 instances of chomps + grumbles combination by two flanged males (Zeke and Kay).
The second biphonic call combination was composed by "kiss-squeaks + rolling calls" and produced by Sumatran adult female orangutans (Pongo abelii) in the context of alarm vocal responses toward predator models at Ketambe, Aceh, Sumatra, Indonesia (12) (see Audio S2). Across 1,287 observation hours, 293 instances of kiss-squeaks + rolling call combinations were recorded from five females (Chris, Elisa, Puji, Sina, and Yet) among a total of 1,176 kiss-squeaks and 1,158 rolling calls from seven adult females. Figure 1 depicts the spectrographic representation of exemplar cases of the two identified biphonic call combinations ( Fig. 1A and D), with inspection by ear providing a clear sense of the double nature of these combinations' sound source (see Audios S1 and S2). Given the different nature of the two sound sources (one in the mouth and one in the larynx), the result was two concurrent, yet distinct sound profiles for each biphonic call combination. Voiceless calls exhibited a higher register in the frequency spectrum and voiced calls a lower register. For "chomps + grumbles" combinations, chomps represent "bubbly" calls and exhibit elements that ascend along a frequency slope between ∼300-600 Hz for ∼0.20 s (10), whereas grumbles resemble a starting engine and exhibit strings ∼1.0 s long of staccato elements at ∼185 Hz (10), each element with ∼0.045 s. Note the independent production and coordination of three chomps along two grumbles (Fig. 1A).
Kiss-squeaks + rolling calls exhibited a more contrasting difference between the two sources, where kiss-squeaks (articulatorily and acoustically homologous to a human kiss-sound) exhibit a stark onset with a noisy follow through composed by a wideband of acoustic energy spreading ∼3,850 Hz for ∼0.375 s (10), whereas rolling calls exhibit elements with acoustic energy concentrated around ∼240 Hz, stringed into sequences ∼0,325 s long (10), where each element is ∼0.03 s long. Note that these biphonic combinations were part of larger call sequences that included other voiced call types, e.g. "grumphs" (10) (Fig. 1D). Spectrogram slice view ( Fig. 1D and E) showed two simultaneous frequency maxima for each biphonic call combination, confirming simultaneous production.

Discussion
Findings reveal that complex motor control involving the combination and coarticulation of two sound sources is found in one of our closest living relatives, wild orangutans. We report and describe two cases of biphonic call production in adult males and females across contexts and across different species of Pongo. This confirms that these vocal behaviors have a common biological basis and were not a simple spurious observation.
Due to their intrinsic nature, biphonic call combinations in orangutans imply distinct channels for neuromotor vocal control, demonstrating new and hitherto underestimated vocal control and coordination capacities in a wild great ape. Based on shared ancestry within the Hominid family, this discovery shows that similar homologous vocal behaviors may have been present in an ancient, now-extinct nonhuman hominid ancestor and that similar or equivalent capacities likely propped speech evolution.
Similar homologous biphonic behavior in humans, involving synchronous supralaryngeal and laryngeal sound production, is found in beatboxing (13). Simultaneously, orangutan biphonic call combinations are analogous to birdsong, also governed by two sound sources (1). These parallels with human and bird vocal expression align with cumulating evidence for advanced vocal control in great apes that challenge traditional assumptions (5,14).
Orangutan biphonic production also seems in part functionally analogous to birdsong and homologous to human beatboxing, where vocal complexity and "exuberance" appear to signal fitness, vigor, and/or condition. For example, chomps + grumbles were produced by male orangutans toward challengers, much like birdsong can be exchanged between male rivals and how beatboxing is used in "battles" between vocal performers. This function would also parsimoniously explain kiss-squeaks + rolling calls by females toward predators, with more complex vocal displays predicted to dissuade more effectively a predator attack.
A behavior's function is not per se an indication of its means of acquisition. For example, some birdsong motifs are innate (15), but this doesn't take away the fact that song production is nonetheless motorically complex in these cases. As such, further study will be required to elucidate the development of biphonic production in orangutans. However, homology with human beatboxing and cases of vocal learning in great apes (5,14) hint at a potential role of practice and auditory feedback.
Birdsong represents a case of convergent vocal evolution with humans, however, analogy across two lineages does not inform homology within one. Accordingly, it has been so far unclear how the abundant knowledge gathered for decades on songbird models (3) can be "translated" into the brains and bodies of ancestral hominids for a true-to-life reconstruction of speech evolution. Our findings show that, beyond humans, birdsong is also convergent with some vocal behaviors in great apes. If songbird's neurobiology for vocal control is analogous to humans' (3), it follows that it must also be analogous to great apes', at least in some degree. This indicates that bird-to-ape exchange of knowledge is possible and desirable. For example, in-depth understanding of birdsong neural and molecular substrates could help develop and test predictions about parallel mechanisms in the great ape brain and bodies, avoiding invasive and ethically prohibitive research in great apes. New bird-inferred insight into great ape neurobiology could also guide fieldworkers and primatologists focus empirical and logistical effort on certain contexts or individual classes. This could potentially help advance more expeditiously the cataloguing of great ape vocal behavior that is currently eroding in the wild (16) and in populations that face increasing extinction risks (17). This concerted effort could assist the identification of further yet-undetermined bird-ape vocal analogies.
Our findings align vocal research in birds and hominids (both human and nonhuman) in ways thus far unseen. They invite comparative and cognitive sciences to collaborate and "share brains" in new ways, heralding leaps in bird-ape cross-pollinating studies and the understanding of speech (and potentially song) evolution in the human clade.

Data collection
Data collection involved opportunistic and experimental (i.e. during predator model experiments) audio recordings of orangutan vocal behavior over 2,510 observations hours at Tuanan (2°09′S; 114°26′E) (10) and 1,287 observation hours at Ketambe (3°41′N, 97°39′E) (12). Data collection involved no interaction with or handling of animals. Research was approved by Indonesian authorities and strictly followed the Indonesian law and local guidelines.

Data analyses
Spectrogram representation and spectrogram slices of orangutan biphonic call combinations were built using Raven Pro 1.6 (18) using window type: Hann; 3-dB filter bandwidth: 56.3 Hz; grid frequency resolution: 1.46 Hz; and grid time resolution: 1,126 samples, according to published protocols (19). No sound or file transformation was applied to avoid interpretation or verification issues. To ascertain biphonic vocal production, we identified, audibly and through spectrogram inspection, instances when two sound sources were patently present.
Biphonic call combination was subsequently confirmed by identifying the power for each independent sound source in the spectrogram slice view during moments of simultaneous sound production. This was possible because consonant-like and vowel-like calls are inherently underpinned by distinct production mechanisms and thus produce distinct acoustic profiles. The original sound files can be found as supplementary materials to facilitate open and transparent inspection of biphonic call production. Given that laryngeal voice production is not patently detectable through video (e.g. as opposed to lip action to produce consonant-like calls), sound analyses provided the most reliable means to establish biphonism. Other setups for the detection of sound location in living mammals require animals of substantial size (e.g. elephants) in captivity (20). Similar setups (e.g. involving an array of 48 microphones arranged in a specific shape around a central video camera) would be practically impossible in a rainforest, with an arboreal species, with an animal with the body size of an ape and/or at a distance.
For information about acoustic variation in the call types comprising biphonic combinations, including individual, age-sex, contextual, and geographic acoustic variations, please see Hardus et al. and Lameira et al. (10,19). Please note that these levels of variation carry no theoretical or mechanical consequence on the demonstration of biphonism (e.g. a call type may be produced biphonically whether it exhibits geographic variation or not).